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Abstract 

In this paper we study the problem of adaptive estimation of a multivariate function 
satisfying some structural assumption. We propose a novel estimation procedure that 
adapts simultaneously to unknown structure and smoothness of the underlying function. 
The problem of structural adaptation is stated as the problem of selection from a given 
collection of estimators. We develop a general selection rule and establish for it global 
oracle inequalities under arbitrary L p -losses. These results are applied for adaptive 
estimation in the additive multi-index model. 

Short Title: Structural adaptation via oracle inequalities 

Keywords: structural adaptation, oracle inequalities, minimax risk, adaptive estimation, 
optimal rates of convergence 

2000 AMS Subject Classification : 62G05, 62G20 

1 Introduction 

1.1 Motivation 

In this paper we study the problem of minimax adaptive estimation of an unknown function 
F : M, d — > R in the multidimensional Gaussian white noise model 

Y(dt) = F(t)dt + eW(dt), t = (t u . . . , t d ) £ V, (1) 

where T> D [— 1/2, l/2] d is an open interval in R d , W is the standard Brownian sheet in 
M. d and < e < 1 is the noise level. Our goal is to estimate the function F on the set 
Vo := [— 1/2, l/2] d from the observation {Y(t), t S T>}. We consider the observation set 
T> which is larger than T>q in order to avoid discussion of boundary effects. We would like 
to emphasize that such assumptions are rather usual in multivariate models, see, e.g., Hall 
(1989) and Chen (1991). 

To measure performance of estimators, we will use the risk function determined by the 
-norm || • 1 < p < oo on Dq: for F : M. d — ► R, < £ < 1, and for an arbitrary estimator 
F based on the observation {Y(t), t € T>} we consider the risk 

K P [F;F]=E F \\F-F\\ p . 
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Here and in what follows denotes the expectation with respect to the distribution 
of the observation {Y(t), t £ T>} satisfying ([1]). 

We will suppose that F G Q s , where {G s ,s S S} is a collection of functional classes 
indexed by s £ S. The choice of this collection is a delicate problem, and below we discuss 
it in detail. 

For a given class Q s we define the maximal risk 

K p [F;g s ] = sup K P [F;F}, (2) 

g^Qs 

and study asymptotics (as the noise level e tends to 0) of the minimax risk 

inf K p [F;g s ] 

F 

where inf p denotes the infinum over all estimators of F. At this stage, we suppose that 
parameter s is known, and therefore the functional class Q s is fixed. In other words, 
we are interested in minimax estimation of F. The important remark in this context 
is that the minimax rate of convergence (j> £ (s) on Q s (the rate which satisfies (j) £ (s) x 
in£ plZ p [F; Q s ]) as well as the estimator attaining this rate (called the rate optimal estimator 
in asymptotic minimax sense) depend on parameter s. This dependence restricts application 
of the minimax approach in practice. Therefore, our main goal is to construct an estimator 
which is independent of s and achieves the minimax rate cj> e (s) simultaneously for all s € S. 
Such an estimator, if it exists, is called optimally adaptive on S. 

Let us discuss now the choice of the collection {Q s ,s G S}. It is well known that the 
main difficulty in estimation of multivariate functions is the curse of dimensionality: the 
best attainable rate of convergence of estimators becomes very slow, as the dimensionality 
grows. To illustrate this effect, suppose, for example, that the underlying function F belongs 
to Q s = M.d(a,L), s = (a,L), a > 0,L > 0, where H^(a,L) is an isotropic Holder ball of 
functions. We give the exact definition of this functional class later. Here we only mention 
that Wd(a,L) consists of functions g with bounded partial derivatives of order < [a\ and 
such that, for all x,y £ V, 

\9(y) ~ Pg(x,V - x)\ <L\x-y\ a , 

where P g (x,y — x) is the Taylor polynomial of order < \a\ obtained by expansion of g 
around the point x, and | • | is the Euclidean norm in R d . Parameter a characterizes the 
isotropic (i.e., the same in each direction) smoothness of function g. 

If we use the risk ([2]), uniformly on ELj(ck, L) the rate of convergence of estimators cannot 
be asymptotically better than 

£ 2 Q /(2a+d) j p€[l,oo) 
feVhTF 1 ) 2 ^ 2 ^ p = oo. 



[cf. Ibragimov and Khasminskii (1982), Stone (1982), Nussbaum (1987), Bertin (2004)]. 
This is the minimax rate on M.d(a,L): in fact, it can be achieved by a kernel estimator 
with properly chosen bandwidth and kernel. More general results on asymptotics of the 
minimax risks in estimation of multivariate functions can be found in Kerkyacharian, Lepski 
and Picard (2001) and Bertin (2004). It is clear that if a is fixed then even for moderate d 
the estimation accuracy is very poor unless the noise level e is unreasonably small. 

This problem arises because the <i-dimensional Holder ball Wd(a,L) is too massive. A 
way to overcome the curse of dimensionality is to consider models with smaller functional 
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classes Q s . Clearly, if the class of candidate functions F is smaller, the rate of convergence 
of estimators is faster. Note that the "poverty" of a functional class can be described in 
terms of restrictions on its metric entropy. There are nevertheless several ways to do it. 

1.2 Structural adaptation 

In this paper we will follow the modeling strategy which consists in imposing additional 
structural assumptions on the function to be estimated. This approach was pioneered by 
Stone (1985) who discussed the trade-off between flexibility and dimensionality of nonpara- 
metric models and formulated the heuristic dimensionality reduction principle. The main 
idea is to assume that even though F is a ci-dimensional function, it has a simple structure 
such that F is effectively m-dimensional with m < d. The standard examples of structural 
nonparametric models are the following. 

(i) [Single-index model] Let e be a direction vector in R d , and assume that F(x) = 
f(e T x) for some unknown univariate function /. 

(ii) [Additive model.]. Assume that F(x) = Yli=i fi{ x i)i where ft are unknown univariate 
functions. 

(iii) [Projection pursuit regression.] Let e±, . . . , e& be direction vectors in R d , and assume 
that F{x) = £ti fi(e T x), where ft are as in (ii). 

(iv) [Multi-index model] Let e± , . . . , e m , m < d are direction vectors and assume that 
F(x) = f(efx, . . . , e^x) for some unknown m-dimensional function /. 

In the first three examples the function F is effectively one-dimensional, while in the fourth 
one it is m-dimensional. The heuristic dimensionality reduction principle by Stone (1985) 
suggests that the optimal rate of convergence attainable in structural nonparametric models 
should correspond to the effective dimensionality of F. 
Let us make the following important remark. 

The estimation problem in the models of types (i), (iii) and (iv) can be viewed as 
the problem of adaptation to unknown structure (structural adaptation). Indeed, if the 
direction vectors are given then, after a linear transformation, the problem is reduced either 
to the estimation problem in the additive model (cases (i) and (iii)) or to the estimation of 
an m-variate function. This explains the form of minimax rate of convergence. The main 
problem however is to find an estimator that adjusts automatically to unknown direction 
vectors. 

This remark allows to state the problem of structural adaptation in the following rather 
general way. 

1.3 Lp— norm oracle inequalities 

Suppose that we are given a collection of estimators {Fg,6 £ C R m } based on the 
observation {Y(t), t £ V}. In the previous examples parameter could be, for instance, 
the unknown matrix E = [e.\, . . . , e^) of the direction vectors, = E, and Fe could be 
a kernel estimator constructed under hypothesis that E and smoothness of the functional 
components are known (a kernel estimator with fixed bandwidth). 

With each estimator F$ and unknown function F we associate the risk 1Z p [Fg ; F] . The 
problem is to construct an estimator, say, such that for all F obeying given smoothness 
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conditions one has 

K P [F*;F] <£ inf K p [F e ;F], (4) 

where C is an absolute constant independent of F and e. Following the modern statistical 
terminology we will call the inequality ^ the L, p -norm oracle inequality. 

Returning to our example with 9 = E we observe that being established, the L p -norm 
oracle inequality leads immediately to the minimax result for any given value of smoothness 
parameter (a, L). In particular, we can state that the estimator F* is adaptive with respect 
to unknown structure. 

It is important to realize that the same strategy allows to avoid dependence of estimation 
procedures on smoothness. To this end it is sufficient 

• to consider 9 = (E, a, L) that leads to the collection of kernels estimators with the 
non- fixed bandwidth and orientation; 

• to propose an estimator F* based on this collection; 

• to establish for this estimator the L p -norm oracle inequality (j4]) for any F £ L2(T>) 
(or on a bit smaller functional space). 

Being realized, this program leads to an estimator that is adaptive with respect to unknown 
structure and unknown smoothness properties. It is important to note that such methods 
allow to estimate multivariate functions with high accuracy without sacrificing flexibility 
of modeling. 

1.4 Objective of the paper 

The goal of the present paper is at least two-fold. 

First we introduce and study a general structural model that we call the additive multi- 
index model; it includes models (i)-(iv) as special cases. This generalization is dictated 
by the following reasons. On the one hand, structural assumptions allow to improve the 
quality of statistical analysis. On the other hand, they can lead to inadequate modeling. 
Thus we seek a general structural model that still allows to gain in estimation accuracy. To 
our knowledge the additive multi-index model did not previously appear in the statistical 
literature. For this model we propose an estimation procedure that adapts simultaneously 
to unknown structure and smoothness of the underlying function. The adaptive results are 
obtained for Loo-losses and for a scale of the Holder type functional classes. 

To study this model we proceed as follows. We state the problem of structural adap- 
tation as the problem of selection from a given collection of estimators. For a collection of 
linear estimators satisfying rather mild assumptions we propose a novel general selection 
rule and establish for it the L p -norm oracle inequality (j4]). Similar ideas were used in Lep- 
ski and Levit (1999), Kerkyacharian, Lepski and Picard (2001), Iouditski et al. (2006) for 
pointwise adaptation. However we emphasize that our work is the first where the L p -norm 
oracle inequality is derived directly without applying pointwise estimation results. It is 
precisely this fact that allows to obtain adaptive results for arbitrary L p -losses. The se- 
lection rule as well as the L p -norm oracle inequality are not related to any specific model, 
and they are applicable in a variety of setups where linear estimators are appropriate. We 
apply these general results to a specific collection of kernel estimators corresponding to the 
additive multi-index model. 
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1.5 Connection to other works 



Structural models. The heuristic dimensionality reduction principle was proved by Stone 
(1985) for the additive model (ii), and by Chen (1991) and Golubev (1992) for the pro- 
jection pursuit regression model (iii) that includes as a particular case the single-index 
model (i). In particular, it was shown there that in these models the asymptotics of the 
risk ([2]) with p = 2 and with Q s , s = (a,L), where Q s is either Holder or Sobolev ball, is 
given by ip £) i(a). As we see, the accuracy of estimation in such models corresponds to the 
one-dimensional rate (d = 1). 

Further results and references on estimation in models (i)-(iv) can be found, e.g., in 
Nicoleris and Yatracos (1997), Gyorfi et al. (2002, Chapter 22), and Ibragimov (2004). Let 
us briefly discuss the results obtained. 

• The estimators providing the rate mentioned above depend heavily on the use of L2- 
losses (p = 2) in the risk definition. As a consequence, all proposed constructions 
cannot be used for any other types of loss functions. 

• Except for the paper by Golubev (1992), where the estimator independent on the 
parameter s = (a,L) was proposed for the model (i), all other estimators depend 
explicitly on the prior information on smoothness of the underlying function. 

• As far as we know there are no even minimax results obtained for the model (iv) . One 
can guess that asymptotics of the risk (|2J) is given by ip etTn (a) which is much better 
then (i-dimensional rate ip E ^d{oi) since m < d. 

It is also worth mentioning that there is vast literature on estimation of vectors e^, when 
fi are treated as nonparametric nuisance parameters; see, e.g., Huber (1985), Hall (1989), 
Hristache et al. (2001a, 2001b) and references therein. 

Oracle approach. To understand the place of the oracle approach within the theory of 
nonparametric estimation let us quote Johnstone (1998): 

"Oracle inequalities are neither the beginning nor the end of a theory, but when 
available, are informative tools." 

Indeed, oracle inequalities are very powerful tools for deriving minimax and minimax adap- 
tive results. The aim of the oracle approach can be formulated as follows: given a collection 
of different estimators based on available data, select the best estimator from the family 
(model selection) [see, e.g., Barron, Birge and Massart (1999)], or find the best convex/linear 
combination of the estimators from the family (convex/linear aggregation) [see Nemirovski 
(2000), Tsybakov (2003)]. The formal definition of the oracle requires specification of the 
collection of estimators and the criterion of optimality. 

The majority of oracle procedures described in the literature use the L2~risk as the 
criterion of optimality. The following methods can be cited in this context: penalized like- 
lihood estimators, unbiased risk estimators, blockwise Stein estimators, risk hull estimators 
and so on [see Barron, Birge and Massart (1999), Cavalier et al. (2002), Golubev (2004) and 
references therein] . The most general results in the framework of L2~risk aggregation theory 
were obtained by Nemirovski (2000) who showed how to aggregate arbitrary estimators. 

Other oracle procedures were developed in the context of pointwise estimation; see, e.g., 
Lepski, Spokoiny (1997), Goldenshluger and Nemirovski (1997), Belomestny and Spokoiny 
(2004) for the univariate case, and Lepski and Levit (1999), Kerkyacharian, Lepski and 
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Picard (2001) for the multivariate case. Moreover, Lepski, Mammen and Spokoiny (1997) 
and Kerkyacharian, Lepski and Picard (2001) show how to derive L p -norm oracle inequali- 
ties from pointwise oracle inequalities. Although these L p -norm oracle inequalities allow to 
derive minimax results on rather complicated functional spaces, they do not lead to sharp 
adaptive results. 

Finally we mention the Li-norm oracle approach developed by Devroye and Lugosi 
(2001) in context of density estimation. 

The rest of the paper is organized as follows. In Section [2] we present our general 
selection rule and establish the key oracle inequality. Section [3] is devoted to adaptive 
estimation in the additive multi-index model. The proofs of the mains results are given in 
Section [H Auxiliary results are postponed to Appendix. 



2 General selection rule 
2.1 Preliminaries 

In what follows || • || p stands for the L p (2?)-norm, while || • \\ Ptq denotes the L p q(X>xX>o)— norm: 
\\G\\ P , q =(J (J \G(t,x)\Pdty /P d x y /q , p,qe [l,oo]. 

We write also | • | for the Euclidean norm, and it will be always clear from the context which 
Euclidean space is meant. 

Let C M m . Assume that we are given a parameterized family of kernels fC = 
{Kg(-,-),6 G 0}, where Kg : T> x T>q — > R. Consider the collection of linear estimators 
of F associated with family 1C: 



F(K) = [P e {x) = J K e (t,x)Y(dt), 6 G ©}. 



Our goal is to propose a measurable choice from the collection {Fg,9 €0} such that the 
risk of the selected estimator will be as close as possible to infgige 1Z p [Fg, F]. 
Let 

Bg(x) := [ Kg(t,x)F(t)dt-F(x), Z e (x) := [ Kg(t,x)W(dt); (5) 



then Fg{x) — F{x) = Bg(x)+eZg(x), so that Bg(-) and eZg(-) are the bias and the stochastic 
error of the estimator Fg respectively. We assume that the family K, of kernels satisfies the 
following conditions. 

(K0) For every x £ Vq and 6 £ the support of Kg(-,x) belongs to V, 

K e (t,x)dt = 1, V(x, 6) G T>q x 0, (6) 

cr(/C) := sup I \K B 1 1 2j oo < 00, (7) 
flee 

M(/C) := sup{sup||^(-,x)||i V sup||tf e (t,-)||i) < 00. (8) 
flee L x t > 

Remark 1 Conditions (0|) and ([?]) are absolutely standard in the context of kernel estima- 
tion, and only condition ([#)) has to be discussed. First we note that (£§)) is rather mild. In 
particular, if collection K, contains positive kernels then M(IC) = 1. Moreover M(JC) will 
appear in the expression of the constant C in the L p -norm oracle inequality Q). 
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(Kl) For any 9, v G 9 



Kg(t,y)K v (y,x)dy = J Kg(y,x)K v (t,y)dy, V(x,t)£V xV. (9) 

Remark 2 Assumption Kl is crucial for the construction of our estimation procedure, 
and it restricts the collection of kernels to be used. We note nevertheless that property ([Pjj 
is trivially fulfilled for convolution kernels Kg(t,x) = Kg(t — x), which correspond to the 
standard kernel estimators. 

The next example describes a collection of kernels corresponding to the single-index 
model. 

Example. Let K : M. d — ► R, f K(t)dt = 1, E be an orthogonal matrix with the first 
vector-column equal to e. Define for all h G Ri 



Denote 7i = {h G R+ : h = (hi, /i max , • • • , Vm), hi £ [h m in,h max ]}, where the bandwidth 
range [h m [ n , /i max ] is supposed to be fixed. The collection of the kernels corresponding to the 
single-index model is 

K, = x) = K h [E T (t-x)], 9 = (E,h) G G =£ x H C R d |, 

where £ is the set of all d x d orthogonal matrices. 

Clearly, M(K) = \\K\\i so that KO is fulfilled if \\K\\i < oo. Assumption Kl is trivially 
fulfilled because Kg(t,x) = K$(t — x). 

For 9, v G we define 



K 6;U (t,x):= I K e (t,y)K u (y,x)dy, (10) 

and let 



Fg iV (x) := J K e , v (t,x)Y(dt), x G V 



Observe that Kg v = K U: g in view of ([9]), so that indeed Fg jU = F Ui g. This property is 
heavily exploited in the sequel, since the statistic Fg >u is an auxiliary estimator used in our 
construction. We have 

F 0tU (x) - F(x) = j K e ,v(t, x)F{t)dt - F(x) + e J K e>v (t, x)W(dt) 

=: Bg tU (x) + eZ e , v {x). (11) 
The next simple result is a basic tool for construction of our selection procedure. 
Lemma 1 Let Assumption KO hold; then for any F G I^P) Pi L P (D) 

s\xv\\Bg, v - B v \\ p <M(K,)\\Bg\\ p , V0G6. (12) 
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Proof : By definition of Bq iU , B v and by the Fubini theorem 

Be, u (x) - B„(x) = K 0iV (t,x)F(t)dt- K v (t,x)F(t)dt 



K u (y, x) [ J Kg(t, y)F(t)dt - F{y) 
K u (y,x)B e (y)dy. 



dy 



The statement of the lemma follows from the general theorem about boundedness of integral 
operators on L p -spaces [see, e.g., Folland (1999, Theorem 6.18)] and (JSJ. | 



2.2 Selection rule 

In order to present the basic idea underlying construction of the selection rule we first 
discuss the noise-free version (e = 0) of the estimation problem. 

Idea of construction (ideal case e = 0). In this situation 

= ]>„(•) = J K e (t, -)F(t)dt, ye e e}. 

so that Fg can be viewed as a kernel-type approximation (smoother) of F. Note that the 
risk 1Z p [Fq; F] = \\Fg — F\\ p = \\Bg\\ p represents the quality of approximation. Let Fg t be a 
smoother from J-(1C) with the minimal approximation error, i.e. 

0* = arg inf K p [F e ;F). 

Suppose that fC satisfies Assumptions K0 and Kl. Based on this collection we want to 
select a smoother, say Fx £ FQC), that is "as good as" Fg,, i.e., the smoother satisfying 
Lp-oracle inequality 

To select 6 we suggest the following rule 

6 = arg inf {sup \\Fg v - F U \L}. 
eee ueQ 

Let us compute the approximation error of the selected smoother Fx. By the triangle 
inequality 

ll-^ellp = — F\\ p < \\Fq — F^ e J\p + \\Fjj^ — Fg^Wp + \\Fg^ — F\\p 

= ll-^e — -^e,6»» Hp + H-^e,e» — ^* Hp + H^ e * Hp- (1^) 

In view of Assumption Kl and (|12p the first term on the right hand side of (|13|) does not 
exceed M(K,)\\Bg t \\ p . To bound the second term we use the definition of 9 and (fT2|) : 

\\Bgg ~ Be* \\ P < sup \\B§ v — B u \\ p 

< snp\\B e ^ u -B u \\ p <M(JC)\\B J p . 
i/ee 

Combining these bounds we obtain from ()13|) that 

K P [F § ; F] < (2M(/C) + l)\\B 0m \\ p = (2M(/C) + 1) inf TZ p [F e ; F]. 

Therefore in the ideal situation e = 0, the L p -oracle inequality holds with C = 2M(/C) + 1. 



8 



Example (continuation). We suppose additionally that there exists a positive integer I 
such that 

J t k K{t)dt = 0, \k\ = l,...,l, 

where k = (k\, . . . , k^) is the multi-index, fej > ; |fc| = k% + • ■ ■ + k^, t k = i^ 1 • • • t^ d for 
t = (ti, . . . . , td). Let e is the true direction vector in the model (i). After rotation described 
by the matrix E for any h G H we have 



\Bo. 



< 



K{u)[f{- + h lU )- f{-)]du 

If there exists < a < Z + 1, L > such that f G Hi (a, L) then 

\\Be,\\ p < Lhf, Vft-i G [h min , h max \. 
It is evident that when there is no noise in the model, the best choice of hi is h B 



(14) 



Idea of construction (real case e > 0). When the noise is present, we use the same 
selection procedure with additional control of the noise contribution by its maximal value. 
Similarly to the ideal case our selection rule is based on the statistics {sup„ g6 > \\Fg jL/ — 
F u \\ p , 9 G 0}. Note that 



\Fe,i> — Fv\\p < \\Be,v 



B U \\ P + e\\Zo 



Z, 



u\\p 



< \\B e>u - B u \\ p + sup \&e, v {x)\ sup \\Z dtV \\ p , 

x B,v 

where Zg v (-) and Z v {-) are given in (fTTj) and © respectively, and 



Z$,v{x) 



E\Zg >v ( X )-Z u {x)\ 2 = \\Kg ;1/ {;x)-K v {;x)\\l X G Vq, 

max{ag tU (x) , 1} 



(15) 

(16) 
(17) 



Remark 3 In what follows we will be interested in large deviation probability for the max- 
imum of the process Zg v (x) — Z v [x). Typically the variance ag v {x) of this process tends to 
infinity as e — * 0; therefore in the most interesting examples ag v {x) = o~g v (x), and Zg v {x) 
has unit variance. However, for an abstract collection of the kernels, it can happen that 
ag v {x) is very small, for example, if Kg approaches the delta-function. That is why we 
truncate the variance from below by 1. 

In the ideal case we deduced from (fT2|) that 

[M(ZC)]- 1 sup \\Fg iV - F u \\ p < \\Bg\\ p , V0 G 6, (18) 

i.e., the left hand side can be considered as a lower estimator of the bias. In the case of 
e > we would like to guarantee the same property with high probability. 

This leads to the following control of the stochastic term. Let 5 G (0,1), and let 
Kp = Hp(lC, 5) be the minimal positive real number such that 

P{ SUp \\Zg(-)\\p > Hp] + P{ SUp \\Zg tU (-)\\p > X p \ < 5, (19) 

L eee J 1 (6»,i/)e0x0 J 
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where similarly to (| 16j) and (|17p we set 

Z e (x) := a g 1 (x)Z e (x), 

a 2 e (x) := E\Z (x)\ 2 = \\K e (;x)\\l 

The constant x p controls deviation of ||2^j,[L as well as the deviation of standardized 
stochastic terms of all estimators from the collection IFQC). We immediately obtain from 
(USD, (US]) and dHD that 

Bg(p) : = [M(ZC)]- 1 sup [\\F e ,„ - F v \\ p - ex p sup^(a;)] < \\Bg\\ p , V# G 6, (20) 

with probability larger than 1 — 5. 

Thus, similarly to (|18p . Bg(p) is a lower estimator of the L p -norm of the bias of the 
estimator Fg. This leads us to the following selection procedure. 

Selection rule. Define 

§ = §(5) := arg inf { Bg(p) + k p {K, 5) e sup a e (x) }, (21) 
eee x 

and put finally 

F(8) = F § . 

Remark 4 The choice of 6 is very natural. Indeed, in view of [2U\) for any 8 £ Q with 
high probability 

Bg(p) + X p ESUpag(x) < \\Bg\\ p + X p E SUp ag(x). 

X X 

On the other hand, under rather general assumptions (see Section \2.4\ ) 

\\Bg\\ p + ex p supo-g(x) <CK p [Fg;F], 

x 

where C is an absolute constant, independent of F and e. Therefore with high probability 

B^(p) + x p esup<Tg(x) < C inf TZ P [F 9 ; F]. 

x ^S0 

Thus in order to establish the "Lp-norm oracle inequality it suffices to majorate the risk of 
the estimator Fx by B g {p) + k v e sup^. o~g(x) and to choose 5 = 5(e) tending to zero at an 
appropriate rate. 

2.3 Basic result 

The next theorem establishes the basic result of this paper. 

Theorem 1 Let Assumptions K0 and Kl hold, and suppose that 

(I) 9 defined in $21}) is measurable with respect to the observation {Y(t),t £ T>}, and 9 
belongs to 0; 

(II) the events in U9\) belong to the a-algebra generated by the observation {Y(t),t £ T>}. 
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Let 5 G (0, 1), x p be defined in \19}) . and F be such that (I) and (II) hold. Then 

E F \\F(S) - F\\ p < [3 + 2M(£)] inf { \\B e \\ p + x p e[supa e (x)] } + r(5), (22) 

0G0 x 

where 

r(5) := || J P|| CX) [l + M(/C)]<5 + ( 7(/C)^ 1 /2[ E | C |2 ] i/2 ) 

crQC) is defined in £ := sup x g\Zg(x)\, and E denotes expectation with respect to the 
Wiener measure. 

Remark 5 In order to verify measurability of 6 and the condition (II) we need to impose 
additional assumptions on the collection of kernels fC. These assumptions should guarantee 
smoothness properties of the sample paths of Gaussian processes {Zg(x), (x,9) £ Do x 0} 
and {Zg tl/ (x), (x,6,v) 6 Do x 8 x 0}. It is well-known [see, e.g., Lifshits (1995)] that such 
properties for Gaussian processes can be described in terms of their covariance structures. 
In our particular case, the covariance structure is entirely determined by the collection of 
kernels JC. These fairly general conditions on fC are given in Section \2.J\ 

To ensure that 6 G we need not only smoothness conditions on the stochastic processes 
involved in the procedure description, but also conditions on smoothness of F . It is sufficient 
to suppose that F belongs to some isotropic Holder ball, and this will be always assumed 
in the sequel. This hypothesis also guarantees that F is uniformly bounded, which, in 
turn, implies boundedness of the remainder term r(5). It is important to note that neither 
procedure nor inequality \22\) depend on parameters of this ball. 

Remark 6 Our procedure and the basic oracle inequality depend on the design parameter 
5. The choice of this parameter is a delicate problem. On the one hand, in order to reduce 
the remainder term we should choose 5 as small as possible. On the other hand, in view of 
the definition, x v = x p (5) —* oo as 5 —* 0. Note that we cannot minimize the right hand 
side of I122\) with respect to 5 because this leads to 5 depending on unknown function F. 
Fortunately, the same assumptions from Section \2Jj\ ensure that up to an absolute constant 

inf { \\B e \\ p + x p e[sup(r e (x)] } > e. (23) 
0e0 x 

The form of the remainder term r{8) together with Ii23\) suggests that 5 should depend on 
e, for example, 5 = 5(e) = e a , a > 1. Such a choice under assumptions from Section \2.J\ 
allows to show that 

where C(p), p G [1, oo], are absolute constants, independent of e. 

Although the inequality (|22p is not stated in the form of the L p -norm oracle inequality, 
it can be helpful (in view of (|24p ) for deriving adaptive minimax results. To demonstrate 
this we return to the single-index model. 

Example (continuation). Remind that = (E,h) and note that 

= vl, h {x) = [hiK^]- 2 I K 2 h [E T (t - x)]dt = [h^lr'WKWl 
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does not depend on E and x. Fix 5 = e a and let F £ be the estimator F(e a ) satisfying JF, 
Then A22\) takes the form 

E F \\F e -F\\ p < (3 + 2||ir||i)inf [\\B E ^ h \\ p + e>c p {e a )sxx V aE,h{x)] + 0{e a ) 

E,h x 

< (3 + 2||ir||i) inf [inf \\B E , h \\ p + e^,(e°)[/»iO" 1/2 |l^ll2] + 0{e a ) 

h E 

< (3 + 2||if|| 1 ) inf [Lb% + ^(e a )[/M^]" 1/2 ||if|| 2 ] + 0(e a ). 



The last inequality follows from p4\ ). Taking into account (3$) , choosing h max > inde- 
pendent of e, h m - m = e 2 , and minimizing the last inequality with respect to h\ £ [/i m m, h max \ 
we obtain for all a > 0, L > 

™V mFe-F\\ p <C p (L,h m ^K)l 
/eHi(a,L) ^ [ey'ln (1/e) J , p = oo. 

It remains to note that F £ does not depend on (a, L), and attains in view of the last inequality 
the minimax rate of convergence for all values of (a, L) simultaneously. It means that F £ 
is optimally adaptive on the scale of Holder balls. 



2.4 Key oracle inequality 

In this section we discuss the choice of 5 which leads to the key oracle inequality. This 
inequality is suitable for deriving minimax and minimax adaptive results with minimal 
technicalities. In particular, we will use it for adaptive estimation in the additive multi- 
index model. 

In order to establish the key oracle inequality we need to impose additional conditions 
on the collection of kernels /C. In particular, these conditions should guarantee the bounds 
(I24p for K p {5{e)). In the case p = oo such conditions are rather mild and standard; they 
are related to deviation of supremum of Gaussian processes and therefore can be expressed 
through smoothness of their covariance functions (Lifshits 1995). As for the case p < oo, 
we need to establish bounds on large deviation probabilities of the L p -norm of Gaussian 
processes. It requires additional assumptions on the collection of the kernels. Moreover, 
such bounds cannot be directly obtained from the existing results. We note nevertheless 
that (|24p for the case p < oo can be shown under fairly general assumptions, and this will 
be the subject of a forthcoming paper. From now on we restrict ourselves with the case 
p = oo. 

In the end of this section we discuss the connection between the key oracle inequality 
and the Loo-norm oracle inequality of type (jl|). 



Assumptions. We suppose that the set has the following structure. 

(A) 6 = 0i x ©2 where 0i = {0 1 , . . . ,0 N } is a finite set, and 62 C M. m is a compact 
subset of M m contained in the Euclidean ball of radius R. Without loss of generality 
we assume that R > 1. 

Remark 7 Assumption A allows to consider both discrete and continuous parameter sets. 
In particular, the case of empty 02 corresponds to selection from a finite set of estimators. 
This setup is often considered within the framework of the oracle approach. In order to 
emphasize dependence of kernels Kg on 9\ £ ©i and 02 £ ©2, we sometimes write Kiq x> q 2 \ 
instead of Kg. 
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(B) There exists M such that F £ M d (M ), where 

M d (M ) = {g: ge (J M d (a,L), {{g^ < M }. 

o>0,L>0 

Remark 8 Assumption B is necessary for verification of the condition (I) of Theorem^ 
It is also needed for deriving the key oracle inequality from Theorem [7] since it allows to 
bound uniformly the remainder term in \22\) . 

We emphasize that our procedure does not depend on Mq. Finally note that Hrf(Mo) is 
a huge set of functions (a bit smaller than the space of all bounded continuous functions), 
i.e., Assumption B is not restrictive at all. 

(K2) Denote U := T>q x @ 2 . There exist positive constants L, and 7 £ (0, 1] such that 

\\ K (!h,02)('> X )- K (8ifi'A->rf)h „ r 

sup sup : j. < L, 

0ie6i u,u'au \u — uy 

where u = (x, 6> 2 ), and v! = (x',9' 2 ). Without loss of generality we assume that L > 1. 

Remark 9 Assumption K2 ensures that sample paths of the processes {Zg(x), (x, 9) £ T>q x 
0} and {Zg u (x), (x,9,u) £ T>q x © x 0} belong with probability one to the isotropic Holder 
spaces W m+ d(T~) and M2 m +d(T~) with regularity index < r < 7 (Lifshits 1995, Section 15). 
In particular, it is sufficient for fulfillment of conditions (I) and (II) of Theorem^ 

Choice of 5. Now we are ready to state the upper bound on the risk of our estimator 
(f2Tj) under Assumptions A, B, K0-K2. Define 

Cjc := M{K)LR 



Theorem 2 Let Assumptions A, B, K0-K2 hold, and assume that there exists a > such 
that 

5, :=mm{i,C- (2m+rf)/ V 2 [ ff (/C)]- 2 }> e a . (25) 

Let F* = F(5*) be the estimator of Section^ associated with the choice 5 = 5*. Then there 
exists a constant C\ > Mq depending on d, m and 7 only such that 

E F ||F„ -FHoo < [3 + 2MQC)] inf {\\B e \\oo + CieVlne- 1 supa e (x)\ . (26) 

Remark 10 Typically in nonparametric setups L ~ e~ ai , <r(/C) ~ e~ a2 for some «i, «2 > 0. 
// N grows not faster than e~ as , then [25]) holds. 



Loo-norm oracle inequality. Finally we show how the Loo-norm oracle inequality (J3|) 
can be obtained from Theorem [2j 

Theorem 3 Assume that there exists a constant C 2 > such that 



infE||Z,(.)||oo>C 2V /ln(l/ £ ), (27) 



and let F* be the estimator from Theorem^ Then 
where C = [3 + 2M(K)] max{l, d/C 2 }. 



ftoo [F*;F] <£inf IZoo [Fg; F], 
6»ee 
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Remark 11 The condition (21) seems to be necessary in order to have the constant C 
independent of e. In fact, \21^ is an assumption on the collection of kernels fC. To verify 
this condition one can use the Sudakov lower bound on the expectation of the maximum of 
a Gaussian process [see, e.g., Lif shits (1995, Section 14)]- 

The proof of Theorem [3] is an immediate consequence of Theorem [21 (|27[) . an d the 
following auxiliary result that is interesting in its own right. 

Lemma 2 Let F(-) = J S(t,-)Y(dt) be a linear estimator of F{-). Denote by B${-) and 
eZs(-) the bias and the stochastic part of F(-) — F(-) respectively. Then for any F G 
L p (P) n L 2 (D) and p € [1, oo] 

j{\\B s \\ p + eE\\Z s \\ P } < K P [F; F] < \\B s \\ p + eE||Z s || p . (28) 

3 Adaptive estimation in additive multi— index model 

In this section we apply the key oracle inequality of Theorem [2] to adaptive estimation in 
the additive multi-index model. 



3.1 Problem formulation 

We impose that following structural assumption on the function F in the model (pQ). 
Let I denote the set of all partitions of (1, . . . , d), and for n > let 

£ v = {E = (e 1 ,...,e d ):e i e S^" 1 , |det(£)| > rj}. 

For any I El and E £ £ rj let Ex, ... , E\n be the corresponding partition of columns of E. 

(F) Let I = . . . , Jij-i) € X, and E € £ v . There exist functions fi : M) 1 ^ — > R, i = 
1, . . . , |/| such that 

W 

F(t) = Y,h(Ejt). 
i=i 

Assumption F states that the unknown function F can be represented as a sum of \I\ 
unknown functions fi, i = 1, . . . , \ I\, where fi is \Ii\ -dimensional after an unknown linear 
transformation. Note that partition / is also unknown. The assumption that |det(2£)| > n is 
chosen for technical reasons; note that our estimation procedure does not require knowledge 
of the value of this parameter. 

Later on the functions fi will be are supposed to be smooth; in particular, we will 
assume that all /j's belong to an isotropic Holder ball (see the next definition). 

Definition 1 A function f : T — » R, T C M s , is said to belong to the Holder ball H s (/3, L) 
if f has continuous partial derivatives of all orders < I satisfying the Holder condition with 
exponent a S (0, 1] : 

\\D k f\\ OQ <L, V|fe| =0,...,/; 



1 1 



3=0 ] ' \k\=j 



<L\z-tf, Vz,t£T, 



,,,, ,, , • ._ . ffc — + fc i . . . +k a 

for t = (ti, . . . . , t s ), and D k = d^/dt^ ■ ■ ■ dt\ 



where (3 = l+a, k = (k±, . . . , k s ) is the multi-index, k% > 0, |fc| = fei+- • -+k s , t —t-y ■■• t s 

it- 
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The described structure includes models (i)-(iv). 

1. [Single-index model.] Let F(t) = f{e T t) for some unknown / : M. d — > R and e G 
S rf_1 . In order to express the single-index model in terms of assumption F, we set 
E = (ei, . . . , e^) with ei, e2, . . . , being an orthogonal basis of M. d such that e\ = e. 
In this case we can set I = (Ji, I2) with Ii = {1}, I2 = {2, . . . , d} and /1 = /, /2 = 0. 

2. [Additive model.] Let -F(£) = ^f=i fi( x i) f° r unknown /j : R d — > R. Here -E is the 
d x d identity matrix, and I = (1%, . . . , Id), h = {i}- 

3. [Projection pursuit model.] Let F(t) = Ylt=l fi( e J^) f° r unknown fa : R d — > R 1 
and unknown linearly independent direction vectors ei,...,e^ G Here -E = 
(ei,...,e d ), I = (h,...,I d ), h = {i}. 

4. [Multi-index model.] Let -F(t) = f{eft, . . . , e^t) for unknown direction vectors 
ei,...,e m G S d_1 , and unknown function / : M" 1 — ► R 1 . We define E = (ei, . . . , e d ), 
where (e m +i, . . . , e<j) is the orthogonal basis of the orthogonal complement to the 
subspace spanjei, . . . , e m }. In this case we set I = (Jx,!?), I\ = (l,...,m), I2 = 
(to + 1, . . . , d), and fi = f, f 2 = 0. 

Definition 2 We say that function F belongs to the class F/ i g(/3, L), (3 > 0, L > if 
(i) Assumption F is fulfilled with partition I = {I\, . . . , Jiji) G T and matrix E G 5^; 
(n) i/iere exist positive real numbers and L suc/i i/iai G H|/.|(/3j, L), i = 1, ... , |/|; 
(Hi) For all % = 1, . . . , |/| 

= 177. (29) 

Remark 12 Tae meaning of condition (Hi) is that smoothness of functions fi is related to 
their dimensionality in such a way that the effective smoothness of all functional components 
is the same. This condition does not restrict generality as smoothness of a sum of functions 
is determined by the worst smoothness of summands. 

Let F be an estimator of F G F/ i e(/3, L); accuracy of F is measured by the maximal 
risk 

H 00 [F;Wi, E (J3,L)]:= sup E F \\F - F^. 
Fev IiE (p,L) 

Proposition 1 (Minimax lower bound) let (p £ ((3) = [ey/\n(l/e)] 2l3 ^ 2!3+1 \ Then 
liminfinf^- 1 ^) Koo[F;F LE (/3,L)] > 0, I el, E e £ v , 

where inf is taken over all possible estimators F. 

Remark 13 The appearance of the univariate rate (f e (P) in the lower bound is not sur- 
prising since 2/3/(2/? + 1) = 2/%/(2/3j + \Ii\), i = 1, . . . , 1 1\ in view of <fIP|). It is worth 
mentioning that <p e (P) = V'eJi'jCA) ^ s the minimax rate of convergence in estimation of 
each component fi [cf. ([3p/. 
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The proof of Proposition [T] is absolutely standard and is omitted. Obviously, the accuracy 
of estimation under the additive multi-index model cannot be better than the accuracy 
of estimation of one component provided that all other components are identically zero. 
Since E is fixed, the problem is reduced to estimating \Ii\- variate function of smoothness 
Pi in the model In this case the lower bound is well-known and given by ij) e ua (Pi). It 
remains to note that ip £ ,\ii\{@i) does not depend on i and coincides with ^ e (P) in view of 
(22]). 

Below we propose an estimator that attains the rate (p e (P) simultaneously over F/ i g(/3, L), 
I £ 1, E £ £ v , < P < Pmax < oo, L > 0, i.e., the optimally adaptive estimator. 

3.2 Kernel construction 

To construct a family of kernel estimators let us consider the idealized situation when both 
the partition I = (/]_,..., Lji) £ I and E £ £ v are known. 

(G) Let g : [—1/2, 1/2] — ► M be a univariate kernel satisfying the following conditions 

(i) J g(x)dx = 1, Jg(x)x k dx = 0, k = !,...,£; 

(ii) geC\ 

Fix a bandwidth h = (h\, . . . , h^), /i m in < hi < h max and put 

d 

Go(t) = ]Jg(ti) 

G i>h (t) = n^(^)n^i), i=i,...,\i\. 

Now we define the kernel associated with partition /, matrix E, and bandwidth h. Fix 
9 = (I, E, h) £ 9 = 1 x £ v x [h min , h max ] d , and let 

K e {t) = |det(E)| ^ G ith (E T t) - - l)\det(E)\G (E T t). (30) 

i=l 

3.3 Properties of the kernel 

First we state evident properties of the kernel Kg. 
Lemma 3 For any 8 £ 

J K e (t)dt = 1 
||K e || 1 <(2|J|-l)|H|f. 

\\Keh < |det(E)| 1 / 2 || 5 ||^(^ [] hf 12 + |/| - l). (31) 

i=l jeh 



The proof follows straightforwardly from (|30p . 

Next lemma establishes approximation properties of Kq. Put for any x £ T>q 

B e (x) = J K e (t- x)F(t)dt- F(x). 

Clearly, Bg(-) is the bias of the estimator associated with kernel Kg. 
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Lemma 4 Let F G ¥j e(/3,L) 7 and let Assumption G hold with t = maxj|_/5ij. Then 

\i\ 

»=1 jG/i 

Remark 14 Lemmas^ and^ allow to derive an upper bound on the accuracy of estimation 
on the class F/ ; e(/3, L) for given I and E. Indeed, the typical balance equation for the 
bandwidth selection takes the form 



ey/WMe)\\K e \\ 2 = \\B e \ 



Therefore using the upper bounds in \32^ and \31\) we arrive to the optimal choice of band- 
width given by h = h* = (h*, . . . ,h* d ) , 

h* = (-v/MViyj (MM J , jeh, i = l,...,\i\. (33) 

If Fg(x) = J Kg(t — x)Y(dt) is a kernel estimator with 6 = (I,E,h*) then we have the 
following upper bound on its h^-risk: 

n^W^L)] < CLVW+V^P), (34) 

where C is an absolute constant. Thus, in view of Proposition^ l -p e {P) i> s the minimax rate 
of convergence on the class F/ i e(/3, L). We stress that construction of minimax estimator 
Fq requires knowledge of all parameters of the functional class: I, E, j3 and L. 



3.4 Optimally adaptive estimator 

Let h min = e 2 and /i max = e 2 /[(2/3 max +i)d] £ Qr some ^ max > q Consider the collection of 
kernels K. = {Kq(-), 6 = (I,E,h) £ 0} where Kg(-) is defined in ([30)) , The corresponding 
collection of estimators is given by 

= {P e (x) = j K (t - x)Y(dt), 9 £ G)|. 

Based on the collection J-{1C) we define the estimator F* following the selection rule (|21j) 
with the choice of 5 = e a where a = 24cf 3 + 12d 2 . 

Theorem 4 Suppose that Assumption G holds with £ = [df3 miluX \ . Then for any I G I, 
E G £ v , < j3 < (3 mBX , and L > 

limsnp ip^^IZ^F^F j tE (p,L)} < CLVW +1 \ 

where C depends on d, /3 max , and the kernel g only. 

Combining the results of Theorem H] and Proposition [T] we obtain that the estimator 
F* is optimally adaptive on the scale of functional classes Fj e(P,L). Thus this estimator 
adjusts automatically to unknown structure as well as to unknown smoothness. 

We note that traditionally any structural assumption is understood as the existence of 
the structure. Mathematically in our case it means that the underlying function belongs 
to the union of classes F/^/?, L) with respect to I G T and E £ £ v , i.e., 

Fe¥(j3,L) = (J ¥ I:E ((3,L). 
iei,Ee£ v 
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Next theorem shows that our estimation procedure is optimally adaptive on the scale of 
functional classes W((3,L), < (3 < /3 maX ) L > 0. 

Theorem 5 Suppose that Assumption G holds with t = [df3 max \ . Then for any < (5 < 
/5max, and L > 

limsup^^^^oofF^F^L)] < CL l 'W +1 \ 
where C depends on d, (3 max , and the kernel g only. 

Theorem [5] follows immediately from Theorem [5j Proposition [1] together with Theorem [5] 
shows that in terms of rates of convergence there is no price to pay for adaptation with 
respect to unknown structure. 

4 Proofs of Theorems QQ, [2] and [5] 

Proof of Theorem Q3 Define the random event 

A = A\ n A2 := w : sup \\Zq\\ p < x v \ PI <^ oj : sup \\Zg !V \\ p < >c p > . 

1°. First, we observe that 

B e (p)l(A) < \\Bg\\ p , M6 £ 6. (35) 
Indeed, in view of Lemma Q] on the set A 

\\Be\\ p > sup— \ f K u (t,x)B e (t)dt 

u&Q \\K-v\\l,oo J V 

> [M(ZC)]- 1 supfH^ - F v \\ p - e\\Z e , v - Z v \\ p ) 

> [M(£)] _1 sup [\\Fe :l/ -F v \\ p - x p £swpa e ,v(x)} = B e (p), 

where we have also used definition of A and the fact that 

F e , v {x) - F v {x) = j K u {t, x)B e (t)dt + e[Z e , v {x) - Z v {x)\. 

2°. Second, we note that for any 8,v £ Q 

SWpag jU (x) = \\Kg >v - K u \\ 2 ,oo < \\Kg tV \\ 2 ,oo + H^lb.oo 
x 

< \\Kg\\ li00 \\K u \\ 2>00 + \\K U \\ 2>00 < [1 + M(/C)] ||^|| 2 ,oo 

= [1 + M(JC)]supa v (x). 

X 

Here we have used the inequality [|-K^ ^[(2 00 < ll^ll 1,00 ||2,oo which follows from the 
Minkowski integral inequality. 

The Cauchy-Schwarz inequality and ([6]) yield a u (x) > (mesjP})" 1 / 2 for all x and v. 
This implies without loss of generality that for any S,!/G6 

sup a* „ (a:) < [l + M(JC)]supa v (x). (36) 
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3°. Now define 



9* := arg inf { \\Bg\\ p + x p esupa e (x) }, 

060 x 



and let F* = Fe, • We write 

\\F-F\\ P 1(A) < \\F e .-F\\ p l(A) + \\F 9m -F $fi J p l(A) 

and note that 

\\F 9t - F\\ p 1(A) < \\B e J\ p + x p e sup a 6t (x) = inf { ||B || p + x p esupa e (x) }. (38) 

x 06© a; 

Furthermore, 

[|F fl . - F §A \\ P 1(A) < M(K;)B § (p)l(A)+x p empa § Ax) 

X ' 

< M(JC)B § (p) 1(A) + [1 + M(JC)]x p esupa ef (x), 

X 

where the first inequality follows from definition of Bg (p) ; the second inequality is a conse- 
quence of ((HJ) and (f36|) . Similarly, 



I^-^JIp 1 ^) < M(/C)B*. (p) 1(A) super, § (x) 

x ' 

< M(K)B e M 1(A) + [1 + M(JC)]x p esupa e (x) 



Now using (|21j) and (j35|) we obtain 

[ii^.-^.iip+n^-^.yi(^) 

< [1 + M(/C)]{[B^(p) + Be, (p)] 1(A) + x p esup x a (x) + Xpesup^, ere* 0*0} 

< 2[1 + M(/C)] { || ||p + x p e sup x ag^ (x)}. 

Then ((371) and (EHD lead to 



||F-F|| P 1(A) < [3 + 2M(/C)] inf{||Be||p + x p esup(7e(x)}. (39) 

060 x 

4°. In order to complete the proof it suffices to bound \\F — F\\ p l(A c ). Note that by 
our choice of >t p (see (|19p ). P(A C ) < 5. Moreover 

||F-F||pl(A c ) < (sup||Be||p + sup||Z e (-)||p)l(A c ) 

060 060 

< HFlUfl + M(K)\l(A c ) + a(K)C 1(A C ), 
where <r(/C) is defined in ((?]), and £ '■= sup^g |Zg(x)|. Therefore 

E||F-F||pl(A c ) < HFHoofl + M(/C)]P(A C ) +a[EC 2 ] 1/2 P 1 / 2 (A c ) 
< \\F\\ 00 [1 + M(JC)]5 + v^a[E|C| 2 ] 1/2 
where we have used (|19p . Combining this inequality with (|39|) we complete the proof. | 
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Proof of Theorem[2l 1°. First we show that Assumptions A, B, and K2 imply conditions 
(I) and (II) of Theorem H 

Indeed, Assumption K2 ensures that sample paths of the processes {Zg(x), (x,9) € 
Vq x 0} and {Zq v (x), (x, 6, v) E Vq x x 0} belong with probability one to the isotropic 
Holder spaces M. m+ d(T) and Bl2 m +rf(r) with regularity index < r < 7 (Lifshits 1995, 
Section 15). Thus the condition (II) is fulfilled. 

Moreover, together with Assumption B this implies that for any F £ M^Mq) sample 
paths of the process F$ )U (x) — F v {x) belong with probability one to the isotropic Holder 
space W2m+d(T~') on Dq x x G with some regularity index < r' < 7. This, in turn, shows 
that for any F £ M.^(Mq) sample paths of the process 

sup \\F g „ - F u \\ p 

belong to EI m (r / ) on 0. Then condition (I) holds in view of Assumption A and Jennrich 
(1969). 

2 . It follows from Lemma [6] in Appendix that for any x > 1 + y/(2m + d)/ 7 

p{sup||Z e (-)||oo >x)+p{ sup \\Ze tU (-)\\oo > x\ 
L 0ee J l (8,i/)eexe J 

< N 2 [ Cl M{)C)LRx} (2m+d) ^ exp{-x 2 /2}, 
where c\ is an absolute constant. By definition of x we obtain that 

exp{x 2 /2} < N^cxM^LRx^+^^S- 1 (40) 

which, in turn, implies 



x < 



m r-1 . -1 Ar 2{2m + d)^ 2m + d 2 .11/2 

21n^ 1 + 41niV + InCx H (Inx 2 + c 2 

7 7 



< y/c^lne- 1 =: x, (41) 

where C3 depends on (2m + d)/j only; here we have used (|25|) . 

Now we bound the remainder term in (|22p . It follows from Lemma [6] that for any 
A > 1 + y/{d + m)h one has 



poo f'OO 

= / 2tP(C > t)dt < 2A + 2 / tN[c 4 LR4 d+m ^^e- t2/2 dt 

JO J A 

/>oo 

< 2A + 27V[c 4 I J R] {d+m)/7 e- A2 / 4 / t 1+ ( d+m )^e~ t2 / 4 ^. 



E|C| ; 



If we choose A = y/2x and apply (f40|) . we get 

E|C| 2 < 2v / 2x + c 5 A r ~ 1 <5* < celnC 1 - 

Using {25]) and the fact that cr(/C) > c 7 we finally obtain r(<5*) < M [l+Af (/C)]e+c 8 ev / lnP T 
which yields ([5Sft . ■ 
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Proof of Theorem Q5J, 1°. In order to apply the result of Theorem [2] we have to verify 
Assumption K2 for the collection of kernels defined in (|30|) . Recall that 9 = (I,E,h), 
and in notation of Assumptions A and K2, 8 = (#i,#2)) where 9± = I G Oi = X, and 
6> 2 = {E, h) G 9 2 = x [/i min , /i max ] d . 

We deduce from (I30p and Assumption G(ii) that K$(t) is continuously differentiable in 
#2 and t, and 

sup sup|V e2jt <M-f n d , 

where L is an absolute constant depending only on d and ||g||oo- Taking into account that 
^min = e 2 w e arrive to Assumption K2 with 

1 = Le- 6d , and 7 = 1/2. (42) 

2°. In view of (|42p . assumption (|25p is verified. 

3°. Fix (5 and L and assume that F G F((3,L). By definition of the class F G F((3,L) 
there exist I» 6l and I?* G ^ such that i 7 G F/^^ (/3, L). Let /j* be given by (f33|) . Then 
from ([261) and (EE 



E F ||F* -FHoo 

< [3 + 2M(/C)] inf f HBj^fclloo + deVlne" 1 supa /)jE)/l (x)) 

< [3 + 2M(/C)]{ II^^^Jloo + CieVlne- 1 sup<7/. ijE . i/l .(x)} 

< 2 [3 + 2Af(£)](Ci V lJCL 1 /^ 1 )^^), 

where C is the constant appearing in (|34|) . | 

Appendix 

Proof of Lemma [2], Only the left hand side inequality should be proved. First we note 
that 

||/|| p = Bup{| J <j>f\: U\\ q = l) 
(Folland 1999, p. 188). Thus we have for p < 00 

E F \\F-F\\ p = F F \\B S + eZ s \\p 



= Fp sup I [Bs{x) + eZs{x)]g{x)dx 
9-h\\ q <^ J 

> F F j[B s {x)+eZ s {x)}g,{x)dx, 

where g*(x) = \\Bs\\ P p ^ q \Bs{x)\ p ~ 1 sign{_Bs(x)}. Therefore 

F F \\B S + eZs\\ p > J B s (x)g,{x)dx + F j Z s (x)g*(x)dx = \\B s \\ p . (43) 

On the other hand, by the triangle inequality E^H-B^ + eZs\\ p > eF\\Zs\\ p — \\Bs\\ p . Com- 
bining the two last inequalities we obtain E_p[|S,5 + eZs\\ p > ^eE||Zs|| p which along with 
flMD yields ®. 
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If p = oo then for any x$ € Vq one has ¥,\\Bg + e^Hoo > ±E[Bq(xo) + ^e(^o)] 
±Be(z ), and therefore E\\Bg + eZ g \\ 00 > H-Belloo- 



Proof of Lemma [4l We will use the following notation: for any vector t S M. d , and 
partition I = . . . , Lji) we will write tu\ = (tj,j G Ij). Throughout the proof without 
loss of generality we assume that E is the d x d identity matrix. 
Using the fact that F(t) = fi(E?t) we have 

r \i\ \i\ . \I\ , 

J K 6 (t-x)F(t)dt = J2Yl J G,, h (t-x)f t (t (t) )dt-(\I\-l)Y^ I G (t-x)Mt®)dt. 

1 = 1 j = l 1 = 1 



i=l j=l 

Note that for all i = 1, . . . , |/| 

J G (t-x)fi(t it) )dt = 

G i;h (t- x)fi(t(i))dt = 
J G j<h (t - x)f i {t i i ) )dt = 
Combining these equalities we obtain 



[n<K*. 



3 X j 



,«(— 



fi(t(i))dt{i) 



J K e (t-x)F{t)dt = /[II^(- ' ' 



j--l,'" 3 " hj 



fi{t(i))dt(i), j^i- 



fi(t(i))dt(i), 



and 



= E fill hC-h 3 -)} ^(*w) - /i(^o)]*w 

i=i J jeit j j 

|7 ' r 1 / '* i 

= e / [n rff V 1 ] - - £ ^ e Dh n(x(i))(Hi) - ^)) fc ] *», 

where the last equality follows from the fact that 

j U^ 9 C l T^ i ) {t{i) ~ X{i))kdt{i)=0, v l fc l : l fc l = 1 '---.^ i = h---,\I\, 
see Assumption G(i). Because fi G H\j.\(f3i, Li), we obtain 

i=l jeJj J J i=i je/j 

as claimed. | 

We quote the following result from Talagrand (1994) that is repeatedly used in the proof 
of Lemma [6] below. 
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Lemma 5 Consider a centered Gaussian process (Xf)teT- Let a 2 = sup tg yi£X 2 . Consider 
the intrinsic semi-metric px on T given by p x (s,t) = K(X S — X t ) 2 . Assume that for some 
constant A> a, some v > and some < Eq < a we have 

e<e N(T, Px ,e) < 

where N(T, px,s) is the smallest number of balls of radius e needed to cover T. Then for 
u > c 2 [(l + y / v)/eo] we have 

F(supX t >u)<(^)Vr 
where K is universal constant, and $>(u) = — h= f°° e _s2//2 ds 



Lemma 6 Let Assumptions A, KO and K2 hold. Then for any k > 1 + i / one has 



P{sup||Z 9 (-)||oo >x}< iV[CiLi?x]( d+m ^exp{-x 2 /2}, (44) 
eee 



where C\ is an absolute constant. 



Furthermore, for any x > 1 + •»/ d+2m one has 



P{ sup ||Z^(-)||oo > x} <iV 2 [C 2 M(/C)Zi?x]( d+2m ^exp{-x 2 /2}, (45) 
(9,i/)eOx9 

where C2 is an absolute constant. 

Proof: 1°. First we prove (|44p . Recall our notation: 

Z e {x)= [ Kg(t,x)W(dt), ag(x) = \\Kg(;x)\\ 2 , Zg(x) = a^ l {x)Zg{x). 



By Assumption A, 8 = (61,62) 6 9i x ©2. Because the set ©1 is finite, throughout the 
proof we keep 6\ € 0i fixed. For brevity, we will write 8 = (61, 62), 6' = (8±,6 2 ), u = (x, 62), 
u' = (x',6 2 ). Also with a slight abuse of notation we write Z(u), Z(u) and a(u) for Zg(x), 
Zg(x) and ag(x) respectively. The same notation with u replaced by u' will be used for the 
corresponding quantities depending on u' . 

Consider the random process {Z(u),u £ U}. Clearly, it has zero mean and variance 
EZ 2 (u) = o~ 2 (u). Let pz denote the intrinsic semi-metric of {Z(u),u £ U}; then 

p z (u,u') := [nZ(u)-Z(u')\ 2 ] l l 2 

= \\ K {e u e 2 ){-,x) - K(ex,e' 2 ){-,x')h 
< L\u — u'| 7 , 

where the last inequality follows from Assumption K2. 

Now consider the random process {Z(u),u 6 U}. Let a = inf ug £/ a(u); then 

p~ z (u,u') := [nZ{u)-Z(u')\ 2 \ 1 / 2 



E 



Z(u) _ Z(u') ,2 



1/2 



1 ,11 

< — —-pz(u,u) + a(u ) 



a(u) ' 'ct(u) a(u')' 

< 9L~ [pz(u, u) + \a(u) - a(u)\] 

< lQl l pz(u, v!) < 2(mes{P}) 1/2 Z|u - u'| 7 - (46) 
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Here we have taken into account that a > (mes{T>}) 1 / 2 , and 

\a{u)-a{u')\ = \\\K e (-,x)\\ 2 -\\K e ,{-,x')\\ 2 \ 

< \\K e (; X ) - K e/ (;X')\\ 2 = PZ (U,U'). 

It follows from (|46p that the covering number N(U, p^,rj) of the index set U = T>q x 02 
with respect to the intrinsic semi-metric p% does not exceed [ciLRrf^\^ d "^ m '^ , where c\ 
is an absolute constant. Then using the exponential inequality of Lemma [5] [with v = 
(d + m)/7, A = c\LR and a = Eq = 1], and summing over all 9\ £ 0i we obtain 

2°. Now we turn to the proof of (|45p . We recall that 



Z e , v (x) - Z v (x) = J [K e , u {t,x) - K v (t,x)]W(dt), 

°eA x ) = \\ K e,v{'i x ) ~ K A'> x )hi 

where Kq >u (-,-) is defined in (fTOj) . We keep 9\,u\ £ Gq fixed, and denote 9 = (01,02), 
9' = (9i,9' 2 ), v = [y\,v 2 ), v' = [y\,v' 2 ). We also denote V = T> x 2 x 62, v = (8,v,x), 
v'(9', u', x'), and consider the Gaussian random processes {Q(v),v E V} and {((v),v £ V}, 
where 

CO) = Z 8:U (x) - Z v (x), C(v) = ag l(x)[Z 8:U (x) - Z v (x)\. 

Let p^ and p^ be the intrinsic semi-metrics of these processes. Similarly to (|46p . it is 
straightforward to show that p^(v, v') < 2p^(v, v'), and our current goal is to bound P((v, v') 
from above. 
We have 

P( (v,v') = [nc(v)-c(v)\ 2 ] 1/2 

= \\Kg tV (;x) -K u (-,x) - K e ,y(-,x')+K u ,(-,x')\\ 2 

< \\K U (; X ) - K u ,{;x')\\ 2 + \\Kg tU (;x) - Kg, y (■ , X ')\\ 2 = J X + J 2 . 

By Assumption K2 

Ji < L\v-v'\~<. 

Let g(-,x) be the Fourier transform of a function g : T> x T>q — > M 1 with respect to the first 
argument, i.e., 

g(uo,x) = j g(t,x) exp{2iriuj T t}dt, \/x G T>q. 

Then, by construction, Kgy{-,x) = Kg(- ,x)K v (- ,x), and 

J 2 = \\Ke(-,x)K v {-,x)-K el {-,x')K v ,{-,x')\\ 2 

< || [Ke(-,x)-k ,(-,x')}K u (-,x)\\ 2 + || [k v (-,x) -k vl {;x')]K ,{.,x')\\ 2 

< \\K v (-,x)\\ x \\K e (-,x) - K e >{-,x')\\ 2 + \\K e ,(; x')\\ x \\K v (-,x) - K u ,(-,x')\\ 2 

< 2M{K,)L\u-u'[ 1 , 

where we have used Assumptions KO and K2. Combining upper bounds for J\ and J 2 we 
get P((v,v') < [1 + 2M(JC)]L\v - t/| 7 , and finally 

p^{v, v) < 2[1 + 2M(/C)]L|u - w'| 7 . (47) 

It follows from (|47p that the covering number iV(V, p^, 77) of the index set V = T>q xQ 2 x 

62 with respect to the intrinsic semi-metric p-: does not exceed [c 2 M(lC)LRri~ 1 }( d+2m ^' y , 

where c 2 is an absolute constant. Then noting that sup„ var(£(u)) < 1, using the exponential 
inequality of Lemma[5] [with v = (d+2m)/ r y, A = c 2 M(fC)LR and a = sq = 1], and summing 
over all (#1,^1) G ®i x 61 we obtain (f4"5j) . 



24 



References 

Barron, A., Birge, L. and Massart, P. (1999). Risk bounds for model selection via 
penalization. Probab. Theory Related Fields 113, 301-413. 

Belomestny, D. and Spokoiny, V. (2004). Local likelihood modeling via stagewise ag- 
gregation. WIAS preprint No. 1000, www.wias-berlin.de 

Bertin, K. (2004). Asymptotically exact minimax estimation in sup- norm for anisotropic 
Holder balls. Bernoulli 10, 873-888. 

Cavalier, L., Golubev, G. K., Picard, D. and Tsybakov, A. B. (2002). Oracle 
inequalities for inverse problems. Ann. Statist. 30, 843-874. 

Chen, H. (1991). Estimation of a projection-pursuit type regression model. Ann. Statist. 
19, 142-157. 

Devroye, L. and LuGOSi, G. (2001). Combinatorial Methods in Density Estimation. 
Springer, New York. 

Folland, G. B. (1999). Real Analysis. Second edition. Wiley, New York. 

Goldenshluger, A. and Nemirovski, A. (1997). On spatially adaptive estimation of 
nonparametric regression. Math. Methods Statist. 6, 135-170. 

Golubev, G. K. (1992). Asymptotically minimax estimation of a regression function in 
an additive model. Problems Inform. Transmission 28, 101-112 

Golubev, G. K. (2004). The method of risk envelopes in the estimation of linear func- 
tionals. (Russian) Probl. Inf. Transm. 40, 53-65. 

Gyorfi, L., Kohler, M, Krzyzak, A., and Walk, H. (2002). A Distribution-Free 
Theory of Nonparametric Regression. Springer, New York. 

Hall, P. (1989). On projection-pursuit regression. Ann. Statist. 17, 573-588. 

Hristache, M., Juditsky, A., and Spokoiny, V. (2001a). Direct estimation of the index 
coefficient in a single-index model. Ann. Statist. 29, 595-623. 

Hristache, M., Juditsky, A., Polzehl, J., and Spokoiny, V. (2001b). Structure 
adaptive approach for dimension reduction. Ann. Statist. 29, 1537-1566. 

Huber, P. (1985). Projection pursuit. With discussion. Ann. Statist. 13, 435-525. 

Ibragimov, I. A. and Khasminskii, R. Z. (1982). Bounds for the quality of nonparametric 
estimation of regression. Theory Probab. Appl. 27, 81-94. 

Ibragimov, I. A. (2004). Estimation of multivariate regression. Theory Probab. Appl. 48, 
256-272. 

Iouditski, A., Lepski, O., and Tsybakov, A. (2006). Statistical estimation of composite 
functions. Manuscript. 

Jennrich, R. (1969). Asymptotic properties of non-linear least squares estimators. Ann. 
Math. Statist. 40, 633-643. 



25 



Johnstone, Iain M. (1998). Oracle inequalities and nonparametric function estimation. 
Proceedings of the International Congress of Mathematicians, Vol. Ill (Berlin, 1998). 
Doc. Math., Extra Vol. Ill, 267-278. 

Kerkyacharian, G., Lepski, O. and Picard, D. (2001). Nonlinear estimation in 
anisotropic multi-index denoising. Probab. Theory Related Fields 121, 137-170. 

Lepski, O. V. and Levit, B. Y. (1999). Adaptive nonparametric estimation of smooth 
multivariate functions. Math. Methods Statist. 8, 344-370. 

Lepski, O., Mammen, E., and Spokoiny, V. (1997). Optimal spatial adaptation to 
inhomogeneous smoothness: an approach based on kernel estimators with variable 
bandwidth selectors. Ann. Statist. 25, 929-947. 

Lepski, O. V. and Spokoiny, V. G. (1997). Optimal pointwise adaptive methods in 
nonparametric estimation. Ann. Statist. 25, 2512-2546. 

Lifshits, M. (1995). Gaussian Random Functions. Kluwer Academic Publishers. 

Nemirovski, A. S. (1985). Nonparametric estimation of smooth regression functions. So- 
viet J. Comput. Systems Sci. 23 , no. 6, 1-11; translated from Izv. Akad. Nauk SSSR 
Tekhn. Kibernet. 1985, , no. 3, 50-60, 235(Russian) 

Nemirovski, A. (2000). Topics in Non-parametric Statistics. Lectures on probability the- 
ory and statistics (Saint-Flour, 1998), 85-277, Lecture Notes in Math., 1738, Springer, 
Berlin. 

Nicoleris, T. and Yatracos, Y. (1997). Rates of convergence of estimators, Kol- 
mogorov's entropy and the dimensionality reduction principle in regression. Ann. 
Statist. 25, 2493-2511. 

Nussbaum, M. (1987). Nonparametric estimation of a regression function that is smooth 
in a domain in R k . Theory Probab. Appl. 31, 108-115. 

Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. 
Ann. Statist. 10, 1040-1053. 

Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 
13, 689-705. 

Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. 
Probab. 22, 28-76. 

Tsybakov, A. (2003). Optimal rates of aggregation. Computational Learning Theory and 
Kernel machines. B. Scholkopf and M. Warmuth, eds. Lectures Notes in Artificial 
Intelligence, 2777 Springer, 303-313. 



26 



